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Abstract 

In molecular systematics, evolutionary trees are reconstructed from sequences at the tips 
under simple models of site substitution. A central question is how much sequence data 
is required to reconstruct a tree accurately? The answer depends on the lengths of the 
branches (edges) of the tree, with very short and very long edges requiring long sequences 
for accurate tree inference, particularly when these branch lengths are arranged in certain 
ways. For four-taxon trees, the sequence length question has been investigated for the 
case of a rapid speciation event in the distant past. Here, we generalize results from this 
earlier study, and show that the same sequence length requirement holds even when the 
speciation event is recent, provided that at least one of the four taxa is distantly related 
to the others. However, this equivalence disappears if a molecular clock applies, since the 
length of the long outgroup edge becomes largely irrelevant in the estimation of the tree 
topology for a recent divergence. We also discuss briefly some extensions of these results 
to models in which substitution rates vary across sites, and to settings where more than 
four taxa are involved. 
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1. Background 



Phylogenetic methods are founded on the notion that evolutionary relationships can 
be inferred from sequences that have evolved along with the taxa. It is usually supposed 
that such sequences evolve according to some continuous-time reversible Markov process, 
or a mixture of such processes (for further background on phylogenetic inference, the 



reader is referred to Felsenstein (2004)). Here, we are interested in the question of the 
sequence length required to accurately estimate a discrete and fundamental parameter of 
evolutionary history, namely the topology of the underlying evolutionary tree. This ques- 



tion has long been of interest in molecular systematics (see, for example, Saitou and Nei 



(1986), Churchill et al. (1992) and Lecointre et al. (1994)) and a variety of mathematical 
approaches have been explored in order to quantify how much 'phylogenetic informa- 



tion' sequence data contains (Mossel and Steel (2005), Townsend (2007), Townsend and 



Leuenberger (2011) and Townsend et al. ( 2012[ )). Although the underlying tree topology 
is rooted, phylogenetic models are generally time-reversible, and so methods based on 
these models produce trees that are unrooted; accordingly, we will say that a method 
correctly reconstructs the tree topology if it does so up to the placement of the root. 
Amongst unrooted trees, the simplest phylogenetic problem involves a set of four taxa, 



for which there are just three resolved binary tree topologies and one 'star tree'. Fischer 



and Steel (2009) investigated the sequence length required to accurately reconstruct a 



binary four-taxon phylogenetic tree with four long pendant branches, and a short interior 
edge (Fig. 1(a), l(a')). This special case is motivated by the scenario in evolutionary 
biology in which a rapid speciation event in the distant past results in all taxa sitting 
on 'long branches' around a short interior edge of length l . The authors found that the 
length of sequence needed to reconstruct the correct four-taxon tree with probability 1 — e 
grows at the rate Ce bL //q, where C and b are positive constants and L is the length of the 
long pendant edges. Notice the impact on the required sequence length of a long branch 
length (i.e. e bL — > oo as L — > oo) and of a short interior branch (i.e. h — > oo as Iq — > 0), 
and that these combine multiplicatively in this lower bound (thus, the cumulative effect of 



a short branch beside a long one becomes compounded much more than if the interaction 
was, say, additive). This formally justifies the informal notion that a very short interior 
edge surrounded by long branches is a particularly challenging phylogenetic problem. 

In this paper, we wish to compare this scenario with another that is at least as 
common in evolutionary biology, namely the setting in which only one of the taxa is 
distantly related to the others, being a distant 'outgroup' taxon (see, for example, Fig. 
1(b)). In particular, we ask whether the sequence length requirements are less severe if 
just one pendant edge is long, rather than all four. We show analytically that essentially 
the same bound (of the form Ce bL //q) applies in general. 

However, a curious situation develops if one imposes a molecular clock. Doing so 
does not affect the exponential depending on L of the sequence length requirements for 
reconstructing the tree in Fig. l(a'). However, in the case of just one distant outgroup 
taxon (Fig. l(b')), the exponential dependence on L (the long branch) disappears entirely. 

Finally, we extend our main result to settings where sites evolve at varying rates 
and we also indicate how the results apply when the four lineages are replaced by four 
monophyletic groups of taxa. 

2. Preliminaries 

We first recall some terminology from phylogenetics. Let X = {1,2,3,4} be a set of 
four taxa, and let T 1? T 2 and T 3 be the three possible unrooted binary trees that have X 
as their leaf set. 

Suppose we have a continuous, stationary, and time-reversible Markov process on a 
state space G that acts at various intensities on the edge of one of these trees. The length 
of an edge will refer to the expected number of substitutions on that edge. This is the 
substitution rate on that edge, multiplied by the temporal duration of that edge. In the 
case where the substitution rate is constant across the tree, we will say that a molecular 
clock applies, but we do not assume this unless otherwise stated. Throughout this paper 
we will let l be the branch length of the interior edge of any four-taxon tree. 
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(a) 



(b) 




Figure 1: In tree (a), a short interior edge is incident with four long pendant edges, representing a rapid 
radiation event deep in the past; tree (a') shows the associated unrooted tree. Tree (b) shows a more 
recent rapid radiation event in which only one of the four incident edges is long, as it joins a distant 
outgroup. Tree (b') shows the associated unrooted tree. 



Let S = G x be the set of of possible assignments of elements of the state space G 
to the leaf set X; we will refer to an element of 5* as a site pattern. Now, suppose we 
generate k site patterns independently according to the same Markov process to form 
sequences of length k (one sequence for each taxon). It is well known that for any set 
of (positive) branch lengths on Tj, one can correctly recover the topology T; from these 
sequences with a probability of at least 1 — e for sufficiently large values of k (for a 



discussion of statistically consistent tree reconstruction, see Felsenstein (2004)). Here 
'sufficiently large' depends not just on e but also on the tree and its associated branch 
lengths. 



As in Fischer and Steel (2009), our arguments rely on the properties of the Hellinger 



4 



distance (dji), which is defined as follows: Given a finite set U, the Hellinger distance 
djiip, q) between two probability distributions p and q on U is defined by the equation: 



(1) d 2 H (p,q) = J^(V^- Jq u f = 2(1 - y/p^). 

Hellinger distances are useful to quantify the amount of data required to accurately 



identify a discrete parameter in a stochastic model. We will describe this below (Lemma 2.1 ) 
in a general setting, not specific to phylogenetics. 

2.1. Hellinger distance bounds on required sequence length 

Let A and U be finite sets, and suppose that each element a e A defines a probability 
distribution p a on U. We will denote the Hellinger distance between p a and pb by dn{a, b), 
and, by slight abuse of terminology, refer to it as the 'Hellinger distance between a and 
V. 

Suppose an element £ of A is selected according to some discrete non-zero probability 
distribution on A. Conditional on £ = a, consider a sequence of k samples of U generated 
independently in U according to the probability distribution p a . Let M : U k — > A be some 
method for estimating the element a & A from a sequence (u\, U2, • • • , Uk) € U k (here M 
may be a deterministic function from U k to A or a process that selects an element of 
A from each element of U k according to some probability distribution - the latter case 
allows ties to be broken randomly). 

Let ri M ' k ^ denote the probability that the method M correctly identifies the element 
a that generates the sequence (u\, v-2, . . . , Uk) € U k under the probability distribution p a . 
In other words: 

r[ M ^ = ¥(M( Ul ,u 2 ,...,u k ) = a\£ = a). 



The following lemma is from Steel and Szekely ( 2002[ ) (Theorem 3.1 and (2.7)). 



Lemma 2.1 Given finite sets U and A, suppose that elements of U are generated i.i.d. 
by some unknown element £ 6 A. Then for any estimation method M that satisfies 



(M,k) 



> 1 — e, for all a £ A, we must have 



k > 




where C e = \{1 




2 , and d = min{d#(a, a!) : a, a' G A; a ^ a'}. 



In our setting, A will consist of a set of phylogenetic trees on the leaf set X = 
{1, 2, 3, 4} and U will be the set S of assignment of states of the elements of X. We will 
use the lemma to prove a lower bound on k in the following Section. 

3. A general lower bound on the required sequence length 

We now present a lower bound for the necessary sequence length k required to re- 
construct a tree of the type shown in Fig. 1(b). This lower bound is essentially of the 
same form as that which applies when all four pendant branches are long - namely, it 
grows exponentially with the length L of the long branch and in inverse proportion to 
the square of the short interior branch, and these factors combine multiplicatively. 

Theorem 3.1 Consider the three-leaf star tree on the taxon set {1, 2, 3} with correspond- 
ing branch lengths h,h,h > S > 0. Suppose that a fourth taxon is attached by a branch 
of length L > to one of the three branches at a distance Iq G (0, S) from the interior 
node. Generate k i.i.d. site patterns at the tips of the resulting four-taxon tree under a 
finite-state, stationary and irreducible Markov process. Then, any method that is able to 
correctly identify with probability at least 1 — e which branch the fourth branch is grafted 
onto requires: 



where C is a constant (independent of l and L) that depends on e, 5 and the rate param- 
eters of the Markov process. 



(2) 



k > Ce bL /ll 
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Moreover, some methods achieve this accuracy using sequences with a length that is 
no more than a constant times e bL 




4 (Ti) 



Figure 2: The three-taxon and four-taxon trees described in the statement and proof of Theorem 3.1 



Proof: To establish Inequality ^ as a lower bound, we first derive an upper bound 
for the Hellinger distance between T\ and the tree T*, formed by grafting the fourth 
branch directly onto vertex u of the three-taxon star tree, as shown in Fig. 2. By the 
triangle inequality, we have: 



(3) 



dH{T 1 ,T 2 )<d H {T x ,T*)+d H {T*,T 2 ) 



Most of the proof is devoted to establishing the following inequality (for a constant 
B = B{5)): 



(4) 



4(Ti,T*) < Bl 2 e~ bL . 
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To establish Inequality Q, let p s and p* denote the probability of generating the site 
pattern s on T\ and T*, respectively. Let N denote the number of substitutions occurring 
along the edge between u and v (the interior edge of Ti), and let r = P(iV > 0). Further, 
let Q s (respectively Q*) denote the conditional probability of generating pattern s on T 
(respectively, on T*) given that N = 0. Similarly, let P s (respectively P*) denote the 
conditional probability of generating pattern s on T (respectively, on T*) given N > 0. 
Then, by the law of total probability, we can write p s and p* as: 



(5) Ps = (1-t)-Q s + t-P s , 

p* s = (1- t )-Q* s +t-P:. 



Let D s = P s — P*. Then, since Q s = Q*, we have: 



(6) Ps - p : = t(p s -p:) = td s . 

Now, the Hellinger distance between T\ and T* (on site patterns) is: 

(7) ten = 2(i-E» 



ses 



Following the approach of Fischer and Steel (2009) (Lemma 5.1), substituting p* 



p s — rD s (from Eqn. (|6|) into ^\ gives: 



4(T 1 ,T*) = 2^1-gp s ^/l" 



rDs 

Ps 



S 



and then the application of the inequality y/1 + y > 1 + y/2 — y 2 /2 for y > — 1 leads to 
the following inequality: 

(8) 4( Tl ,T*)<r 2 .^^. 

ses Ps 

Now, r = P(iV > 0) < E(iV), so we have r < l , and thus we can replace r in ^ by l 
to obtain: 

(9) d 2 H (TuT*)<lZ-J2—- 

ses Ps 

Referring again to Fig. 2, let Xu be the character state on vertex u, let Xv be the 
character state on vertex v, and let Xi, ■ ■ ■ , Xi be the respective character states on the leaf 
set {1,2,3,4}, and let p(s, s') denote the conditional probability P(x« = s', Xv = s"\N > 0). 

Then we have: 

P s = p(s,s')n X l = S 1 \s')nX2 = S 2 \s')n X 3 = S 3 \s'')F(x4 = S 4 \s"), 

s',s"&S 

and 

P s * = s ') P te = sl | s ')P(X2 = s 2 \s')F( X 3 = s 3 \s")F( X 4 = s 4 \s'), 

s',s"es 

where P(%, = s l \x) (for x = s' or s") is probability of generating leaf state at leaf i 
conditional on state x at the vertex of the tree adjacent to leaf i. 
Thus, \D 8 \ = \P S — P*\ is bounded above as follows: 

(10) \D S \ < s 'Mxi = s 1 \s')F( X 2 = s 2 \s')F( X 3 = s 3 \s") ■ 
s',s"eS 

We now invoke the property that any irreducible Markov process converges to its 
stationary distribution at an exponential rate regardless of its starting state (cf. Theorem 
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8.3 of Rozanov (1969)). Specifically, if Y t is the state of such a process when it is run 
for duration t then, for any state s with the equilibrium frequency tt(s), and any second 
state a, we have: 



(11) 



\F(Y t = s\Y = a)- tt(s))| < Ae 



-at 



where A and a depend only on the rate parameters of the Markov process. Using the 



triangle inequality, Inequality (11) gives 



(12) 



P( X4 = s 4 \s') - P(x 4 = s%s")\ < 2Ae 



-aL 



Substituting Eqn. (12) into Eqn. (10), we obtain: 



(13) 



jyi 4A 2 e -2aL 
— - < 



Ps 



Ps 



P(s, s')F( X i = s 1 \s')F( X 2 = s 2 \s')F( X 3 = s 3 \s") 



'65 



and since the term in brackets is bounded above by 1(= s n eS p(s, s')), we obtain: 



(14) 



D 2 AA 2 e ~2aL 

— < . 

Ps Ps 



Note also that, since li, 1%, Z3 > 5 > and L > 0, and the Markov process is irreducible, 
there is some positive p = p(5) such that p s > p. We can thus further reduce Inequality 



(13) to: 



(15) 



D 2 AA 2 e ~2aL 

— - < ■ 

Ps P 



Substituting Eqn. (15) into Eqn. ^ now furnishes the promised justification of Eqn. 
Q, upon taking b = 2a and B = B{5) = 4A 2 /p(S). By symmetry, Eqn. Q gives us 
the same upper bound on d^(T 2 ,T*) as for d 2 H (Ti,T*). We then have, by the triangle 
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inequality: 



(16) d 2 H (T u T 2 ) < (^(T 1 ,T*) + ^(T*,T 2 )) 2 < 4Bl 2 e 



2-bL 



The first part of Theorem 3.1 follows from Lemma 2.1 by taking A = {Ti,T 2 } (so that 
C € = |(1 - 2e) 2 ) and then setting C = jgfw. 

Finally, the last claim in Theorem 3.1 (that e bL /Zjj is an upper bound on the required 



sequence length, up to a constant multiplicative factor) is provided by Theorem 14 of 



Erdos et al. (1999) 



□ 

3.1. Imposing a (relaxed) molecular clock 

When a molecular clock is imposed, there is an interesting shift in the sequence 
length requirements for accurate tree reconstruction. Although we have seen that the two 
scenarios in Fig. 1 lead to the same type of lower-bound dependence of sequence length 
on Iq and L, namely exp(cL)//g, if we impose a molecular clock, then this equivalence 



disappears. More precisely, it is clear (from Fischer and Steel (2009)) that the term 
exp(cL)//Q remains for the deep divergence set-up of our Fig. 1(a), but for the recent 
divergence event shown in Fig. 1(b) we will show that the length of the long edge L is 
largely irrelevant. 

We need to stress here how this result should be interpreted. We are not claiming that 
if a clock applies in the tree that generates the data, then every consistent model-based 
method, such as maximum likelihood, will be immune to the effect of a long branch to an 
outgroup. It will not be so immune if, in the model assumed in the maximum likelihood 
analysis, a molecular clock is not imposed. We are merely claiming that certain methods 
(such as agglomerative clustering, or MLE with a clock) can be immune to a long branch 
if a clock assumption applies. 

We formalize this by a result, in which the full strength of the molecular clock con- 
dition can be relaxed slightly. Note that for the tree in Figure lb, then under a strict 
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molecular clock the branch lengths (as indicated in Fig. 2) must satisfy: 



h — h] h — h + < L. 
We relax this slightly by requiring only that: 
(17) min{/ 3 , L} > max{Zi, l 2 } + l - 

Theorem 3.2 Consider the tree in Fig. 1(b) and suppose that the branch lengths (as 



indicated in Fig. 2) satisfy the relaxed clock condition described in (11). Let k sites 
evolve i.i.d along this tree under a finite-state, stationary and reversible Markov process. 
Then the placement of the branch leading to taxon 4 can be determined correctly with 
probability at least 1 — e provided that: 



-AZ \2 



(18) k>B/(l-e 



where B depends just on Z 1; the model and e, and where X is a constant determined by 
the model. In particular, this bound is independent of the length L of the long branch to 
the outgroup taxon 4. 

Proof: Consider the following simple reconstruction method (this is, essentially, unrooted 
UPGMA on four taxa). Let s(x, y) denote the proportion of sequence sites for which taxa 
x and y have the same state. Select the two taxa that maximize s(x,y) and return the 
(unrooted) quartet tree in which x and y form a cherry. Let e(x,y) be the expected 
value of s(x,y). If Y t (t > 0), denotes the Markov process described in the statement of 
Theorem 13.21 then we have: 



e(x,y) = E[s(x,y)} = ^7r,P(F t = t\Y = i), 

i 

where in this equation that value t = t xy refers to the branch length distance between 
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taxon x and y. By the spectral representation of reversible continous-time Markov pro- 



cesses (see e.g. Chapter 3, Eqn (40) of Aldous and Fill (2010)) we have, for any state 



i: 



p(Y t = ;|Yo = ;) = ^ + I>^ e 



2 „ A m t 



m>2 



where 7Tj is the equilibrium frequency of state i, X m > are the eigenvalues of the rate 
matrix multiplied by -1, and the Uj m values are real coefficients related to the eigenvalues 
of the rate matrix. The A m values can be ordered = Ai < A 2 < A 3 < • • • . Consequently, 
e(ar, y) = *i + E m > 2 c m e" Ami , where c m = J2i ^i u im > °> and so: 

e(x,y) - e(x',y') = ^ c m (e~ Xmtxy - e ~ Am *-v). 

m>2 

Since the coefficient c 2 is strictly positive, if t x i y > — t xy > Iq we can write: 



(19) 



e(x,y) -e(x',y') > c 2 e~ Mx « (1 - e~ xi °) > 



where, for convenience, we let A denote X 2 . Notice that t\ 2 — h + l 2 ,&nd t±3 = h + h 



and t 23 = l 2 + l 3 and so, by the relaxed clock condition (17) we have t 23 — t 12 > Iq and 



tis-t 12 > l Q . Thus (19) holds for (x,y) = (1,2) and (x' : y') = (1,3), (2,3). 



Next, if we Xi 2 . 3 = s(l,2) — s(l,3), then observe that: 



(20) 



(s(l,2) < s(l,3)) = P(X 12;3 < 0) = P(X 12;3 -E[X 12;3 ] < -E[X 12;3 ]). 



In order to exhibit an upper bound this probability, we will apply McDiarmid's inequality 



( McDiarmid 1989). First, observe that we can express s(l,2) — s(l,3) as a sum of k 
independent random variables (one for each site), each taking a value of +1, or —1, and 
this sum has the property that changing any one of these variables (while keeping the 
others fixed) alters s(l, 2) — s(l, 3) by an additive factor whose absolute value is at most 
2/k. Applying the McDiarmid inequality, noting that: e(l, 2)— e(l, 3) > c 2 e~ Atl2 (l— e~ A '°) 
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from Inequality (19), we obtain, from Eqn. (20) that: 



(s(l,2) <s(l,3)) < ex P (-kc 2 2 e- 2Xtl2 (l - e- xl0 ) 2 /2), 



and this can be made less or equal to e/5 whenever Inequality Q18| ) is satisfied for B = 
c^'e-a Atl 2 • By symmetry, P(s(l, 2) < s(2, 3)) is also less or equal to e/5 for this value of k. 
Moreover, by the relaxed clock condition, we also have: 

P(s(l, 2) < s(x, 4)) < e/5 for x = 1, 2, 3. 

Thus, with probability at least 1 — e, the pair {1,2} will have the strictly largest s- 
value; consequently, the correct tree topology will be recovered by the method described 
with probability at least 1 — e. 

□ 

4. Further extensions and concluding comments 

4-1. Rates across sites 

When sites evolve i.i.d. the sequence length required to reconstruct the tree in Fig. 
1(a) accurately grows exponentially with the length of L of the long exterior branches; 
the same holds also for the tree in Fig. 1(b) in the absence of any molecular-clock 



assumption (Theorem 3.1). We point out that these conclusions need not hold when the 
sites evolve independently but not identically under a model that allows substitution rates 
to vary across sites, provided this rate distribution allows arbitrarily small rates, and with 

appropriate density. Suppose, for example, that site i has rate rj = 4- for i = 1, 2, Let 

T' be either an alternative binary tree to T\ or the unresolved tree (i.e. T' = T 2 or T*), 
and let D 2 H (Ti,T') be the Hellinger distance between sequences of length k generated by 
T\ and T' in which the rates at site i of the Markov process is r^. We claim that for a 
sequence length that grows at the (polynomial) rate L 5 , the value Djj(Ti, T") converges to 
2 as L tends to infinity. We first establish this claim and then explain why it implies that 
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one can reconstruct the generating tree (T\) from sequences of a length that is polynomial 
in L. 

By a standard equality relating Hellinger distance of sequences of independent samples 
to the Hellinger distances at each sequence site (easily derived from Eqn. ([I])) we have: 

(21) D%(T h T') = 2 ^1 - - i<£)J , 

where di is the Hellinger distance between the probability distributions on patterns at 
site i (and rate r{) generated by tree T\ and generated by T' . This applies in either the 
setting of Figure la (four long pendant edges) or Figure lb (one long pendant edge). 
Moreover, by definition, 

(22) d\ > (vft - v^) 2 



where pi here refers to the probability of generating a site pattern with leaves 1, 2 in one 
state (say A) and leaves 3,4 in a different state (say B) on T\ at substitution rate Aj, 
while p\ is the corresponding probability for this same site pattern when T\ is replaced by 
T' . Notice that this site pattern can be generated by a state change on just one edge of 
T (the central edge), while on T' at least two pendant edges require state changes. Thus, 
for suitable constants c, c', for the tree in Fig. 1(a) we have: Pi> ° and p\ < c'(j) 2 ; while 
for the tree in Fig. 1(b) we have Pi > | and p\ < c\h){\). Thus, in either case, provided 



i is sufficiently large, Inequality (22) gives 



2 

dL \ d 



(23) 4>|^__^ >=(!- <<!)), 

for a positive constant d, and where o(l) denotes a term that converges to with increasing 
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i. Then, by combining Eqn. (21) and Eqn. (23), we have: 

d(l-o(l)). 



(L 5 \ / L 5 

i- ri(i-k 2 )) >2(i- ria 
j=L 4 / V i=L 4 



2i 



and straightforward asymptotic analysis of the last term reveals that Djj(Ti,T') — > 2 as 
L — > oo. 

Finally, we invoke an inequality (Theorem 3.2) from [Steel and Szekely (2002). If 



M =MLE (maximum likelihood estimation) then for A = {Ti,T'}, the probability that 
MLE correctly reconstructs the generating tree from A is at least ^Dh(Ti,T') and this 
converges to 1 as L grows (with k growing at the rate L 5 ). 

4-2. Breaking up long edges by adding more taxa 

Based on simulation studies and qualitative understanding, it is received wisdom that 
long branches are untrustworthy due to the long branch being able to 'go anywhere'. 
Hence biologists seek to break up this branch either with more characters or more taxa 



see, for example, Felsenstein (2004) or Graybeal (1998)). One could then reconstruct a 



phylogenetic tree for this 'extended' set of taxa and then ignore all but the few taxa one 
is interested in. 

However, as we add more taxa, the number of possible phylogenetic trees grows ex- 
ponentially, and more data are required to reconstruct a larger tree correctly (this can 
easily be seen by a purely counting argument). Thus it is not immediately clear whether 
this strategy has any formal basis for improving accuracy. Here, we show that the se- 
quence length requirements for resolving a four-taxon tree that has one long branch can 
be exponentially (or even double-exponentially) greater than those of the large tree in 
certain ideal situations. 

To see this, suppose we have one of the types of trees shown in Fig. 1, with one or 
more long pendant edges of length L. Suppose one can find a set 5* of N additional taxa 
so that each edge in the resulting tree has a branch length that lies between fixed values, 
say I and I 1 (with / < l < I'). Reconstructing this larger tree accurately requires just 
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some constant times log(iV)// 2 sites under a two-state symmetric model (see Daskalakis 



et al. (2011)) provided I' lies below a critical transition value, while reconstructing the 



four-taxon tree involves a term (e cL ) that grows exponentially with the length L of any 



long edge (by Theorem 3.1) 



The significance of this result hinges on the following question: how does log(iV) 
compare with e cL ? If very short branches are attached at equally spaced intervals along 
the long pendant branch (or branches), then N grows in a linear relationship with L. 
In this case, the sequence length required to reconstruct the four-taxon tree is doubly 
exponential in the sequence length required to reconstruct the much larger tree, as L 



grows (moreover, this does not require the strong technical result from Daskalakis et al. 



(2011) but a weaker though more generally applicable result from Erdos et al. (1999)). 

However, it would be more realistic to constrain the branch lengths in the tree to be 
approximately clocklike. In that case, N need only be of order 2 dL for some constant d; 
log(iV) would then be proportional to L and the sequence length required to reconstruct 
the four-taxon tree would be exponential in the sequence length required to reconstruct 
the much larger tree as L grows. 

In this analysis we are, of course, assuming the most ideal situation, where the taxa are 
distributed as favorably as possible to allow the large tree to be reconstructed; still, it is 
interesting to note that this route - constructing a large tree accurately, then ignoring the 
majority of taxa to consider just the induced phylogenetic relationship between four taxa 
- can require much shorter sequences lengths to achieve the same accuracy (and this holds 
for statistically consistent tree reconstruction methods, not just for inconsistent methods 
that can be 'misled' by long branches). 



4-3. Extension of Theorem 3.1 to trees with more taxa 

Finally, we discuss what happens to our main results concerning four-taxon trees if 
we replace one or more of the four leaves of the tree by subtrees. Firstly, the lower bound 



on k given by Theorem 3.1 still applies if the li values refer to the lengths of the central 



three edges. This is because the sequences at the root of the four subtrees screens off the 
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states of the leaves from the random variable that is the topology T of the central part of 
the tree (by the Markov property). More formally, consider the following two data sets: 

• the sequences Z at the leaves of the tree; 

• the sequences Y at the roots of the four subtrees; 



Since T — > Y — > Z is a Markov chain, the 'data processing inequality' (Cover and Thomas 



(1991)) ensures that I(T; Z) < J(T; Y), where / refers to mutual information. In other 
words, the information that the leaves of the tree tell us about T (the topology of the 
central part of tree) cannot exceed the information that the ancestral sequences at the 
roots of those subtrees provide about T (were these known; recall that we only observe 
sequences at the leaves of the tree). Thus we obtain a conservative lower bound on the 
required sequence length with these considerations. 

However, a tighter bound would presumably take into account how much uncertainty 
there is in the state at the root of one of the four subtrees, given the states we observe 
at the leaves of that tree. 

To simplify the discussion here, consider just the symmetric two-state model of site 
substitution. In this case, let Pi denote the probability of accurately inferring the root 
state of a subtree that stands in place of taxon i from the states at the leaves under 
maximum likelihood (we assume that the topology and branch lengths of the subtree are 
known). If U is the length of the central branch of T that is incident with the root of 
this subtree, then the probability of a substitution across the endpoints of this edge is 

This suggests the possibility of approximating the sequence length required to resolve 
a polytomy in a large tree by replacing each of the four incident subtrees by a single 
taxon, with a net probability of substitution across branch i being set to pi ■ p\. Thus 
we have replaced a phylogenetic tree with four subtrees by a four-taxon tree, in which 
the central edge is of the same length but the pendant edges have been 'lengthened' to 
allow for the loss of information that the leaves provide concerning the root state of each 

18 



subtree. A natural candidate for this 'effective branch length' of branch % would be a 
value of I for which pi -p^ = |(1 — e~ 21 ); this has the solution: I = U + \ log(l/(l — 2pi)). It 
may be interesting to explore this approach further since the computation and behavior 
of the expected root-state reconstruction probability (p^ have been analyzed already by 



a number of authors (e.g. Evans et al. (2000), and Ma and Zhang (2011)) 
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